A Metric Index for Approximate String Matching

نویسندگان

  • Edgar Chávez
  • Gonzalo Navarro
چکیده

We present a radically new indexing approach for approximate string matching. The scheme uses the metric properties of the edit distance and can be applied to any other metric between strings. We build a metric space where the sites are the nodes of the suffix tree of the text, and the approximate query is seen as a proximity query on that metric space. This permits us finding the occ occurrences of a pattern of length m, permitting up to r differences, in a text of length n over an alphabet of size σ, in average time O(m1+ǫ + occ) for any ǫ > 0, if r = o(m/ logσ m) and m > 1+ǫ ǫ logσ n. The index works well up to r < (3− √ 2)m/ logσ m, where it achieves its maximum average search complexity O(m1+ √ 2+ǫ + occ). The construction time of the index is O(m1+ √ 2+ǫn log n) and its space is O(m1+ √ 2+ǫn). This is the first index achieving average search time polynomial in m and independent of n, for r = O(m/ logσ m). Previous methods achieve this complexity only for r = O(m/ logσ n). We also present a simpler scheme needing O(n) space.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Approximate String Matching ? Edgar

We present a radically new indexing approach for approximate string matching. The scheme uses the metric properties of the edit distance and can be applied to any other metric between strings. We build a metric space where the sites are the nodes of the suux tree of the text, and the approximate query is seen as a proximity query on that metric space. This permits us nding the R occurrences of ...

متن کامل

Finding Approximate Matches in Large Lexicons

Approximate string matching is used for spelling correction and personal name matching. In this paper we show how to use string matching techniques in conjunction with lexicon indexes to find approximate matches in a large lexicon. We test several lexicon indexing techniques, including n-grams and permuted lexicons, and several string matching techniques, including string similarity measures an...

متن کامل

An empirical evaluation of a metric index for approximate string matching

In this paper, we evaluate a metric index for the approximate string matching problem based on suffix trees, proposed by Gonzalo Navarro and Edgar Chávez [9]. Suffix trees are used during the index construction to generate intermediate data (pivot table) that to be indexed and the query processing. One of the main problems with suffix trees is their space requirements. To address this, we propo...

متن کامل

Fast approximate string matching with finite automata

We present a fast algorithm for finding approximate matches of a string in a finite-state automaton, given some metric of similarity. The algorithm can be adapted to use a variety of metrics for determining the distance between two words.

متن کامل

n-Gram/2L-approximation: a two-level n-gram inverted index structure for approximate string matching

Approximate string matching is to find all the occurrences of a query string in a text database allowing a specified number of errors. Approximate string matching based on the n-gram inverted index (simply, n-gram Matching) has been widely used. A major reason is that it is scalable for large databases since it is not a main memory algorithm. Nevertheless, n-gram Matching also has drawbacks: th...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002